Deep Dive into Data Driven Maps
- pin or dot maps
- category maps
- proportional symbol & color maps
- graduated symbol & color maps
- Choropleth maps
Taste of Spatial Analysis
Spring 2018
Deep Dive into Data Driven Maps
Taste of Spatial Analysis
Open RStudio & a Script file
Set working directory
Load Libraries (rgeos new - install if needed)
Load our data from part 1
Open the slide deck to follow along: http://bit.ly/geo_r_pt2
Let's load the R libraries we will use
library(sp) library(rgdal) library(rgeos) library(tmap) library(RColorBrewer) library(ggplot2) library(ggmap) ## leaflet, ggmap, maptools, may be needed
## Warning: package 'rgdal' was built under R version 3.4.3
## rgdal: version: 1.2-16, (SVN revision 701) ## Geospatial Data Abstraction Library extensions to R successfully loaded ## Loaded GDAL runtime: GDAL 2.1.3, released 2017/20/01 ## Path to GDAL shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/gdal ## GDAL binary built with GEOS: FALSE ## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493] ## Path to PROJ.4 shared files: /Library/Frameworks/R.framework/Versions/3.4/Resources/library/rgdal/proj ## Linking to sp version: 1.2-5
## Warning: package 'rgeos' was built under R version 3.4.2
## rgeos version: 0.3-26, (SVN revision 560) ## GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 ## Linking to sp version: 1.2-5 ## Polygon checking: TRUE
## Warning: package 'tmap' was built under R version 3.4.3
## Google Maps API Terms of Service: http://developers.google.com/maps/terms.
## Please cite ggmap if you use it: see citation("ggmap") for details.
Load the data if it is not already loaded
# setwd()
sfhomes <- read.csv('data/sf_properties_25ksample.csv')
sfhomes15 <- subset(sfhomes, as.numeric(SalesYear) == 2015)
class(sfhomes15)
## [1] "data.frame"
sfhomes15_sp <- sfhomes15
coordinates(sfhomes15_sp) <- c('lon','lat') # ORDER MATTERS!!
proj4string(sfhomes15_sp) <- CRS("+init=epsg:4326")
sfboundary <- readOGR(dsn="data",layer="sf_boundary")
## OGR data source with driver: ESRI Shapefile ## Source: "data", layer: "sf_boundary" ## with 1 features ## It has 3 fields
sfboundary_lonlat <- spTransform(sfboundary, CRS("+init=epsg:4326"))
We have been mapping data in WGS84 (lon/lat) CRS.
Not a great idea for static maps of larger areas as distortion becomes evident.
Let's explore that distortion and how to address with tmap
Let's load and map data for US states.
Data are in the file data/us_states_pop.shp
Use readOGR to load data/us_states_pop.shp into an sp object named us_states
us_states <- readOGR("./data", "us_states_pop")
## OGR data source with driver: ESRI Shapefile ## Source: "./data", layer: "us_states_pop" ## with 49 features ## It has 4 fields
qtm(us_states)
Review us_states with the sp commands we used earlier and / or explore in the Environment window.
What type of sp object is us_states
How many features does it contain?
How many attributes describe those features?
What is the CRS?
tm_shape(us_states) + tm_polygons(col="grey", border.col = "white")
Notice anything odd about shape of USA?
tm_shape(us_states, projection="+init=epsg:5070") + tm_polygons(col="grey", border.col = "white")
tm_shape(us_states, projection="+init=epsg:5070") + tm_polygons(col="grey", border.col = "white") + tm_shape(us_states) + tm_borders(col="purple")
Also called On-the-fly reprojection in ArcGIS & QGIS
Very cool!
BUT, if you want to use data in a different CRS it is best to transform it.
Transform the us_states data to the USA Contiguous Albers CRS (5070),
Save output as a new SpatialPolygonsDataFrame called us_states_5070
us_states_5070 <- spTransform(us_states, CRS("+init=epsg:5070"))
tm_shape(us_states_5070) + tm_polygons(col="beige") + tm_shape(us_states) + tm_borders(col="purple")
Also called thematic maps or data maps
Use data values to determine symbology
Use symbology to convey data values / meaning
Important for all phases of the research process, from exploratory analysis through reporting results.
Show location
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="black", size=.15)
Show location by feature type - distinguished by color or symbol
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="Neighborhood", size=.15, legend.show = F)
Color Palettes
Data Classification
RColorBrewer is widely used to select a pre-defined color palette.
RColorBrewer is used by ggplot as well as tmap and other plotting libraries.
For more info see ?RColorBrewer or colorbrewer.org
library(RColorBrewer)
There are three types of color palettes
qualitative
sequential
divergent
See ?brewer.pal for details
Complementary colors that emphasize different categories not magnitudes or order.
display.brewer.all(type="qual")
Light to dark shades of the same or complimentary colors to imply order, e.g. higher to lower ranks or values
display.brewer.all(type="seq")
Two contrasting sequential color palettes at either end to emphasize outliers with a lighter palette in the middle to reveal trends.
display.brewer.all(type="div")
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="totvalue", palette="BuPu", size=.15) + tm_layout(inner.margins=c(.05,.05, .15, .15))
Redo the previous map of totvalue using a different sequential color palette.
Redo the Neighborhood map with a sequential color palette.
Redo the totvalue map with a qualitative color palette.
Redo the map of totvalue using a different sequential color palette.
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="totvalue", palette="YlGnBu", size=.15)
Redo the Neighborhood map with a sequential color palette.
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="Neighborhood", size=.15, legend.show = T, palette="BuPu")
Redo the map of totvalue using a different sequential color palette.
tm_shape(sfboundary_lonlat) + tm_polygons(col="white", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="totvalue", palette="Set1", size=.15)

Pandora: favorite electronic dance music tracks by state from “Bad” Maps or Carto-Ineptitude
Symboloy is associated with data in a few ways:
all one color, size or shape
unique symbol for each value
Unclassified maps scale full range of values to color palette
Great for exploring trends and outliers,
but hard to interpret specific data values.
The eye can only differentiate a few colors.
Numerical data values are grouped into bins using a classification method - usually 5 bins
The choice of classification method greatly influences the appearance of the map - and thus the interpretation of the data
Most mapping software uses quantile classification by default - creates the best distribution of colors to data values.
Common methods for binning data into a set of classes include:
equal interval: classes the data into bins of equal data ranges, e.g. 10-20, 20-30, 30-40.
quantile: classes the data into bins with an equal number of data observations. This is the default for most mapping software.
fisher/jenks/natural breaks: classes data into bins that minmizie within group variance and maximize between group variance.
standard devation: classes emphasize outliers by classing data into bins based on standard deviations around the mean, eg -2 to + 2 SDs.
manual: class breaks are hard coded by the data analyst
tmaptmap uses the classificaiton methods in the classIntervals package
The classification method is set by the style= parameter
See ?tm_polygons or tm_dots for keyword options
tm_shape(sfboundary_lonlat) + tm_polygons(col="grey", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="totvalue", style="quantile", size=0.15)
tm_shape(sfboundary_lonlat) + tm_polygons(col="grey", border.col="black") + tm_shape(sfhomes15_sp) + tm_dots(col="totvalue", size=0.15, style="fisher")
sfhomes15_sp$totvalue2 <- sfhomes15_sp$totvalue
tm_shape(sfboundary_lonlat) + tm_polygons(col="grey", border.col="black") +
tm_shape(sfhomes15_sp) + tm_dots(col=c("totvalue","totvalue2"), size=0.15, style=c("quantile","equal") )
sfhomes15_sp$totvalue2 <- sfhomes15_sp$totvalue
tm_shape(sfboundary_lonlat) + tm_polygons(col="grey", border.col="black") +
tm_shape(sfhomes15_sp) + tm_dots(col=c("totvalue","totvalue2"), size=0.15, style=c("equal","cont") )
Data driven mapping is most powerful when dataclassification & color palettes are both used effectively
Create a few maps of totvalue experimenting with different classification schemes and color palettes.
Graduated Color Maps - colors mapped to classified data values
Graduated Symbol Maps - symbol size mapped to classified data values
Proportonal Color or Symbol Maps - colors orsize scalled to all data values
Color areas by data values
Fun name for a type of data map
Sometimes called heatmap or thematic map
Most common type of data map
tm_shape(us_states_5070) + tm_polygons(col="POPULATION")
tm_shape(us_states_5070) + tm_polygons(col="POPULATION") + tm_layout(legend.position=c("left","bottom"))
tm_shape(us_states_5070) + tm_polygons(col="POPULATION", style="jenks")
Make a map that compares Population and popdens using the jenks classification
Change the color palette
Move the legend
What state has the highest population density? Population?
tm_shape(us_states_5070) +
tm_polygons(col=c("POPULATION","popdens"), style="jenks", palette="BuPu" ) +
tm_layout(legend.position=c("left","bottom"))
Often the different sizes of polygons distract from data values
Our eyes are drawn to large areas like CA or Texas biasing the interpretation.
Often data aggregated to polygons are mapped as points!
What type of map is this?
tm_shape(us_states_5070) + tm_polygons(col="white", border.alpha = 0.5) +
tm_shape(us_states_5070) + tm_symbols(size="POPULATION", style="jenks", col="purple") +
tm_layout(legend.position=c("left","bottom"))
Recreate the popdens map as an interactive tmap.
Then click to find the name of the most densely populated state
## tmap mode set to interactive viewing
?tmap_mode
tmap_mode("view")
tm_shape(us_states_5070) + tm_polygons(col="white", border.alpha = 0.5) +
tm_shape(us_states_5070) + tm_symbols(col="popdens", style="jenks",
popup.vars=c("NAME","POPULATION","popdens")) +
tm_layout(legend.position=c("left","bottom"))
Return to plot mode
tmap_mode('plot')
## tmap mode set to plotting
save_tmapYou can save static and interactive maps with tm_save
See: ?save_tmap for details
map1 <- tm_shape(uspop_5070) + tm_polygons(col="POPULATION", style="jenks") save_tmap(map1, "tmap_choropleth.png", height=4) # Static image file save_tmap(map1, "tmap_choropleth.html") # interactive web map
tmapSee vignette("tmap-modes") for more on interactive maps.
tmap github repo (readme file, demo and example folders)
leaflet PackageThere are several R packages that output leaflet maps.
Use the leaflet package for more customized leaflet maps
Highly recommend if you want to make interactive maps.
See the RStudio Leaflet tutorial.
Spatial Analysis begins with an exploration of the data.
Mapping to see its location and distribution
Asking questions of, or querying, your data.
Cleaning & reshaping the data
Applying analysis methods
Mapping analysis results
Repeat as needed
There are two key types of spatial queries
These types are often combined, e.g.
So far we have explored housing values for the city of San Francisco.
The data set consists of a lot of dissaggregated features represented as points.
In this section we will
Use readOGR to load the shapefile sf_pop_by_tracts
data folderThen take a look at the data.
sp object is it?sftracts <- readOGR(dsn="./data", layer="sf_pop_by_tracts")
## OGR data source with driver: ESRI Shapefile ## Source: "./data", layer: "sf_pop_by_tracts" ## with 196 features ## It has 10 fields
class(sftracts)
## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"
proj4string(sftracts)
## [1] "+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0"
summary(sftracts)
## Object of class SpatialPolygonsDataFrame ## Coordinates: ## min max ## x -123.01392 -122.32801 ## y 37.69274 37.86334 ## Is projected: FALSE ## proj4string : ## [+proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0] ## Data attributes: ## STATEFP COUNTYFP TRACTCE AFFGEOID ## 06:196 075:196 010100 : 1 1400000US06075010100: 1 ## 010200 : 1 1400000US06075010200: 1 ## 010300 : 1 1400000US06075010300: 1 ## 010400 : 1 1400000US06075010400: 1 ## 010500 : 1 1400000US06075010500: 1 ## 010600 : 1 1400000US06075010600: 1 ## (Other):190 (Other) :190 ## GEOID NAME LSAD ALAND ## 06075010100: 1 101 : 1 CT:196 Min. : 56564 ## 06075010200: 1 102 : 1 1st Qu.: 294618 ## 06075010300: 1 103 : 1 Median : 409830 ## 06075010400: 1 104 : 1 Mean : 619651 ## 06075010500: 1 105 : 1 3rd Qu.: 660393 ## 06075010600: 1 106 : 1 Max. :6108837 ## (Other) :190 (Other):190 ## AWATER pop14 ## Min. : 0 Min. : 0 ## 1st Qu.: 0 1st Qu.: 3100 ## Median : 0 Median : 4111 ## Mean : 2082654 Mean : 4230 ## 3rd Qu.: 0 3rd Qu.: 5080 ## Max. :247501271 Max. :12391 ##
head(sftracts@data)
## STATEFP COUNTYFP TRACTCE AFFGEOID GEOID NAME LSAD ## 0 06 075 010700 1400000US06075010700 06075010700 107 CT ## 1 06 075 012201 1400000US06075012201 06075012201 122.01 CT ## 2 06 075 013102 1400000US06075013102 06075013102 131.02 CT ## 3 06 075 013500 1400000US06075013500 06075013500 135 CT ## 4 06 075 015500 1400000US06075015500 06075015500 155 CT ## 5 06 075 016300 1400000US06075016300 06075016300 163 CT ## ALAND AWATER pop14 ## 0 183170 0 5311 ## 1 92048 0 4576 ## 2 237886 0 2692 ## 3 235184 0 2592 ## 4 311339 0 3918 ## 5 245867 0 4748
plot(sftracts) # or qtm(sftracts)
You can subset a SPDF just like a DF
Select all features with population > 0
Then replot the map
sftracts<- subset(sftracts, pop14 > 0) plot(sftracts)
We need to spatially join the sftracts and sfhomes15_sp to answer this.
A spatial join associates rows of data in one object with rows in another object based on the spatial relationship between the two objects.
A spatial join is based on the comparison of two sets of geometries in the same coordinate space.
This is called a spatial overlay.
Spatial overlay operations in R are implemented using the over function in the SP and rGEOS libraries.
Point-polygon overlay use SP::over
SpatialLines objects, or pairs of SpatialPolygons require package rgeos, and use gIntersects.
That's likely more detail than you need right now but the point here is that rgeos is the go-to library for vector geometry processing in R.
overYou can interperet over(x,y) as:
You can interperet over(x,y, returnList=TRUE) as:
See ?over for details.
In what tract is each SF property is located?
homes_with_tracts <- over(sfhomes15_sp, sftracts)
If not, why not?
The over function, like almost all spatial analysis functions, requires that both data sets be spatial objects (they are) with the same coordinate reference system (CRS). Let's investigate
# What is the CRS of the property data? proj4string(sfhomes15_sp) # What is the CRS of the census tracts? proj4string(sftracts)
Both data are in a geographic CRS but they are different - WGS84 (sfhomes15_sp) and NAD83 (sftracts)
Transform the CRS of sftracts to that of sfhomes15_sp
Then test that they are identical
# Transform the CRS for tracts to be the same as that for sfhomes15_sp sftracts2 <- spTransform(sftracts, CRS(proj4string(sfhomes15_sp))) # make sure the CRSs are the same proj4string(sftracts2) == proj4string(sfhomes15_sp)
## [1] TRUE
Now let's try that overlay operation again
# Now try the overlay operation again # Identify the tract for each property in sfhomes15_sp homes_with_tracts <- over(sfhomes15_sp, sftracts2)
What is our output? Does it answer our question?
What type of data object did the over function return?
head(homes_with_tracts) # take a look at the output class(homes_with_tracts) nrow(homes_with_tracts) nrow(sftracts2) nrow(sfhomes15_sp)
over discussionOur output homes_with_tracts is a data frame that contains - the id of each property in sfhomes15_sp - all of the columns from sftracts2@data including the census tract id (GEOID)
So we are close to answering our question.
But for the data to be useful we need - to link (or join) the GEOID to the sfhomes15_sp object
homes_with_tracts <- homes_with_tracts[c("GEOID")]
We can use the base cbind (column bind) command to join the data frame to the SpatialPointsDataFrame.
The successful use of this function requires the data to be in the same order!
sfhomes15_sp@data <-cbind(sfhomes15_sp@data, homes_with_tracts) ## NOTE - binding to the @data slot! # Review and note the change # head(sfhomes15_sp@data)
tmapMap the data in tmap interactive mode
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(sftracts2) + tm_polygons(col="beige") + tm_shape(sfhomes15_sp) + tm_dots(col="red")
Did the over command successfully associate the census tract GEIOD with each property?
If yes, you now could link the property data to census demographic data by GEOID.
We just did what's called a type of spatial overlay query called a Point-in-Polygon query.
We asked "In what tract is each property located?".
Data linkage via space!
We now know the tract for each property.
Now let's think about this question from the tract perspective.
Let's ask the question
Since we joined GEOID to each property we can use the non-spatial aggregate function to compute the mean of totvalues for each GEOID.
mean_totvalue_by_tract <- aggregate(totvalue ~ GEOID, sfhomes15_sp, mean) # Take a look head(mean_totvalue_by_tract)
## GEOID totvalue ## 1 06075010100 1315017 ## 2 06075010200 1182728 ## 3 06075010300 1739681 ## 4 06075010400 1199131 ## 5 06075010500 1537299 ## 6 06075010600 1657004
So that it is clear that it is the mean for the tract!
colnames(mean_totvalue_by_tract) <- c("GEOID","mean_totvalue")
head(mean_totvalue_by_tract)
## GEOID mean_totvalue ## 1 06075010100 1315017 ## 2 06075010200 1182728 ## 3 06075010300 1739681 ## 4 06075010400 1199131 ## 5 06075010500 1537299 ## 6 06075010600 1657004
However, we can't map our data frame of mean_totvalues by GEOID.
We can use sp::merge to join the mean_totvalue_by_tract DF to the sftracts2 SPDF.
We should make sure that
the number of rows in sftracts2 and mean_totvalue_by_tract are the same
they share a column of common values - GEOID
When we join two data objects based on values in a column it is called a data table join by attribute.
The sp:merge makes this syntax simple for sp objects with @data slots.
sftracts2<- merge(sftracts2, mean_totvalue_by_tract,
by.x="GEOID", by.y="GEOID")
IMPORTANT: DO NOT merge the DF to the @data slot! but rather to the SPDF!
## Don't do this:
sftracts2@data <- merge(sftracts2@data, mean_totvalue_by_tract,
by.x="GEOID", by.y="GEOID")
head(sftracts2@data)
## GEOID STATEFP COUNTYFP TRACTCE AFFGEOID NAME LSAD ## 7 06075010700 06 075 010700 1400000US06075010700 107 CT ## 20 06075012201 06 075 012201 1400000US06075012201 122.01 CT ## 36 06075013102 06 075 013102 1400000US06075013102 131.02 CT ## 40 06075013500 06 075 013500 1400000US06075013500 135 CT ## 45 06075015500 06 075 015500 1400000US06075015500 155 CT ## 54 06075016300 06 075 016300 1400000US06075016300 163 CT ## ALAND AWATER pop14 mean_totvalue ## 7 183170 0 5311 1862440.8 ## 20 92048 0 4576 614217.4 ## 36 237886 0 2692 1286435.1 ## 40 235184 0 2592 1493924.9 ## 45 311339 0 3918 873522.1 ## 54 245867 0 4748 1446370.5
Create an interactive tmap of census tracts colored by mean_totvalue
tm_shape(sftracts2) + tm_polygons(col="mean_totvalue", style="jenks")
Above we asked "what is the census tract id for each property?"
We then used the non-spatial aggregate function to calculate the mean totvalue for each tract.
Finally, we did a spatial merge to join this results to the census tracts.
However, we can ask the same from the tract perspective using sp::aggregate
Compute mean totvalue by census tract
tracts_with_property_count <- aggregate(x = sfhomes15_sp["totvalue"],
by = sftracts2, FUN = mean)
class(tracts_with_property_count)
## [1] "SpatialPolygonsDataFrame" ## attr(,"package") ## [1] "sp"
head(tracts_with_property_count@data)
## totvalue ## 0 1862440.8 ## 1 614217.4 ## 2 1286435.1 ## 3 1493924.9 ## 4 873522.1 ## 5 1446370.5
plot(tracts_with_property_count)
tracts_with_property_count$GEOID <- sftracts2$GEOID
Did both methods get the same result?
Check the mean totvalue for the same GEOID in each SPDF
tracts_with_property_count[tracts_with_property_count$GEOID == "06075010700", ]$totvalue
## [1] 1862441
sftracts2[sftracts2$GEOID == "06075010700",]$mean_totvalue
## [1] 1862441
Many methods of spatial analysis are based on distance queries.
For example, point pattern analysis considers the distance between features to determine whether or not they are clustered.
We can also use distance as a way to select features spatially.
Let's compute the mean property value within 1 km of Coit Tower
First, we need to know where Coit Tower is. How can we find that out?
library(ggmap)
library(ggplot2)
coit_tower <- geocode('Coit Tower, San Francisco, CA')
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Coit%20Tower%2C%20San%20Francisco%2C%20CA
In order to select properties with 1KM of Coit Tower we - create a 1KM radius buffer polygon around the Coit Tower point
We then do a point-in-polygonn operation to either count the number of properties within the buffer or compute the mean totvalue.
rgeos is another powerful and widely used library for working with geospatial data.
It is the muscle for - creating new geometries from exisiting ones - calculating spatial metrics like area, length, distance - calculating the spatial relationship between two geometries.
We can use the rgeos::gBuffer function to create our buffer polygon
coordinates(coit_tower) <- c("lon","lat")
proj4string(coit_tower)<- CRS("+init=epsg:4326")
coit_tower_utm <- spTransform(coit_tower, CRS("+init=epsg:26910"))
library(rgeos)
coit_km_buffer <- gBuffer(coit_tower_utm, width=1000)
tm_shape(sfhomes15_sp) + tm_dots(col="blue") + tm_shape(coit_km_buffer) + tm_borders(col="red", lwd=2) + tm_shape(coit_tower_utm) + tm_dots(col="red")
Now that we have our buffer polygon, how can we compute the mean totvalue of properties within the buffer?
buff_mean <- aggregate(x = sfhomes15_sp["totvalue"],
by = coit_km_buffer, FUN = mean)
coit_buffer_lonlat <- spTransform(coit_km_buffer,
CRS(proj4string(sfhomes15_sp)))
buff_mean <- aggregate(x = sfhomes15_sp["totvalue"],
by = coit_buffer_lonlat, FUN = mean)
What is the mean property value within 1KM of Coit Tower in 2015?
View(buff_mean)
That was a whirlwind tour of just some of the methods of spatial analysis.
There was a lot we didn't and can't cover.
Raster data is a another major topic! - but the raster package is the key
library(knitr)
purl("r-geospatial-workshop-f2017-pt2.Rmd", output = "scripts/r-geospatial-workshop-f2017-pt2.r", documentation = 1)